How to Find Outliers for Normal Distributions

Outliers are extreme data points that stand apart from the majority of a dataset. In normal distributions, identifying outliers is essential for understanding data variability and detecting unusual events. This section explains how to determine whether a data point is significantly low, significantly high, or a potential outlier based on its relationship to the mean and standard deviation.

Outliers

What is an Outlier?

Let $\mu$ and $\sigma$ be the mean and standard deviation of a population, and let $x$ be any random value sampled from the population.

We say that $x$ is significantly low if the value of $x$ is less than or equal to $\mu-2\cdot\sigma$.
We say that $x$ is significantly high if the value of $x$ is greater than or equal to $\mu+2\cdot\sigma$.
We say that $x$ is a potential outlier if it is significantly low or significantly high.
We say that $x$ is not significant if it is not a potential outlier.

Important Note

What is an outlier depends on the data set, which is why we say potential outlier instead of outlier in our definition. These rules are just general guidelines on what is commonly accepted as outlying or extreme data points. We will be refining this definition all semester.

Example : Determining Significance Using Standard Deviation

<>A university tracks the number of hours students study per week. A recent study found that the mean study time per week is $\mu = 15$ hours with a standard deviation of $\sigma = 4$ hours. One student reports studying 25 hours in a week. Another student reports studying only 5 hours in a week. Using the criteria that a value is considered significantly high if it is greater than $\mu + 2\sigma$ and significantly low if it is less than $\mu - 2\sigma$, determine whether these study times are significant.

Solution

To determine significance, we calculate the thresholds:

Significantly high threshold: \[ \mu + 2\sigma = 15 + 2(4) = 15 + 8 = 23 \]

Significantly low threshold: \[ \mu - 2\sigma = 15 - 2(4) = 15 - 8 = 7 \]

Now, compare the given values:

$25 > 23$, so studying 25 hours is considered significantly high.
$5 < 7$, so studying 5 hours is considered significantly low.

Conclusion: The student who studied 25 hours had a significantly high study time, while the student who studied 5 hours had a significantly low study time.

$$\tag*{$\blacksquare$}$$

Example

In the United States, the unemployment rates follow a normal distribution with average rate of 5.7% and a standard deviation of 1.9%. Determine if the following employment rates are significantly high, significantly low, or not significant.

Part A: An unemployment rate of 8.2%.
Part B: An unemployment rate of 1.6%
Part C: An unemployment rate of 12.4%

Note: you do NOT need to convert % values in this problem to decimals since you are not taking a percent of any particular values.

Solution

The thresholds for determining if an unemployment rate is significantly high or significantly low are calculated using the following formulas:

Significantly Low: Less than or equal to $ 5.7\% - 2 \times 1.9\% = 1.9\% $
Significantly High: Greater than or equal to $ 5.7\% + 2 \times 1.9\% = 9.5\% $

Part A: 8.2%
Since 8.2% is greater than 1.9% but less than 9.5%, it is not significant.
Part B: 1.6%
Since 1.6% is less than 1.9%, it is considered significantly low.
Part C: 12.4%
Since 12.4% is greater than 9.5%, it is considered significantly high.

$$\tag*{$\blacksquare$}$$

Conclusion

Outliers help identify unusual data points that may represent errors, anomalies, or special cases. In normal distributions, a data point is considered a potential outlier if it is more than two standard deviations away from the mean. Understanding how to detect outliers is crucial for accurate data interpretation and analysis.